System and method for exploiting scene graph information in construction of an encoded video sequence

ABSTRACT

A system method and computer program product for creating a composited video frame sequence for an application. A current scene state for the application is compared to a previous scene state wherein each scene state includes a plurality of objects. A video construction engine determines if properties of one or more objects have changed based upon a comparison of the scene states. If properties of one or more objects have changed based upon the comparison, the delta between the object&#39;s states is determined and this information is used by a fragment encoding module if the fragment has not been encoded before. The information is used to define, for example, the motion vectors for use by the fragment encoding module in construction of the fragments to be used by the stitching module to build the composited video frame sequence.

RELATED APPLICATIONS

This application is a continuation-in-part of U.S. patent application Ser. No. 13/911,948, filed Jun. 6, 2013. This prior application is incorporated by reference herein in its entirety.

TECHNICAL FIELD

The present invention relates to the creation of an encoded video sequence, and more particularly to using scene graph information for encoding the video sequence.

BACKGROUND ART

It is known in the prior art to encode and transmit multimedia content for distribution within a network. For example, video content may be encoded as MPEG video wherein pixel domain data is converted into a frequency domain representation, quantized and entropy encoded and placed into an MPEG stream format. The MPEG stream can then be transmitted to a client device and decoded and returned to the spatial/pixel domain for display on a display device.

The encoding of the video may be spatial, temporal or a combination of both. Spatial encoding generally refers to the process of intra-frame encoding wherein spatial redundancy (information) is exploited to reduce the number of bits that represent a spatial location. Spatial data is converted into a frequency domain over a small region. In general for small regions it is expected that the data will not drastically change and therefore there much of the information will be stored at DC and low frequency components with the higher frequency components being at or near zero. Thus, the lack of high frequency information of small area is used to reduce the representative data size. Data may also be compressed using temporal redundancy. One method for exploiting temporal redundancy is through the calculation of motion vectors. Motion vectors establish how objects or pixels move between frames of video. Thus, a ball may move between a first frame and a second frame by a number of pixels in a given direction. Thus, once a motion vector is calculated, the information about the spatial relocation of the ball information from the first frame to the second frame can be used to reduce the amount of information that is used to represent the motion in an encoded video sequence. Note that in practical applications the motion vector is rarely a perfect match and an additional residual is sometimes used to compensate for the imperfect temporal reference.

Motion vector calculation is a time consuming and processor intensive step in compressing video content. Typically, a motion search algorithm is employed to attempt to match elements within the video frames and to define motion vectors that point to the new location that objects or portions of objects. This motion search algorithm compares macroblocks (i.e., tries to find for each macroblock the optimal representation of that macroblock in past and future reference frames to a certain criterion), and determines the vector to represent that temporal relation. The motion vector is subsequently used (i.e., to minimize the residual that needs to be compressed) in the compression process. It would be beneficial if a mechanism existed that assists in the determination of these motion vectors.

As appreciated by those skilled in the art, another expensive component of the encoding process for more advanced codecs is the process to find the optimal macroblock type, partitioning of the macroblock and the weighing properties of the slice. H.264, for example, has 4 16×16, 9 8×8 and 9 4×4 luma intra prediction modes, 4 8×8 chroma intra prediction modes and inter macroblocks can be partitioned from as coarse as 16×16 to as fine grained as 4×4. In addition to that, it is possible to assign a weight and offset to the temporal references. A mechanism that defines or assists in finding these parameters directly improves scalability.

SUMMARY OF THE EMBODIMENTS

In a first embodiment of the invention there is provided a method for creating a composited video frame sequence for an application wherein the video frame sequence is encoded according to a predetermined specification, such as MPEG-2, H.264 or other block based encoding protocol or variant thereof. A current scene state for the application is compared to a previous scene state wherein each scene state includes a plurality of objects. A video construction module determines if properties of one or more objects have changed (such as, but not limited to, the object's position, transformation matrix, texture, translucency, etc. . . . ) based upon a comparison of the scene states. If properties of one or more objects have changed, the delta between the object's states is determined and this is used by a fragment encoding module in case the fragment is not already available in a fragment caching module. This information is used to define, for example, the motion vectors used by the fragment encoding module in the construction of the fragments for the stitching module to build the composited video frame sequence from.

In certain embodiments of the invention, the information about the changes in the scene's state can also be used to decide whether a macroblock is to be encoded spatially (using an intra encoded macroblock) or temporally (using an inter encoded macroblock) and given a certain encoding, what the optimal partitioning of the macroblock is. In certain embodiments, the information about the changes in the scene's state may also assist in finding the optimal weight and offset of the temporal reference in order to minimize the residual. The benefits of using scene state information in the encoding process is a gain in efficiency with respect to the resources required to encode the fragments, as well as improvements in the visual quality of the encoded fragments or to minimize the size of the encoded fragments because spatial relations in the current scene state or temporal relations between the previous scene state and current scene state can be more accurately determined.

Some embodiments of the invention may maintain objects in a 2 dimensional coordinate system, 2 dimensional (flat) objects in a 3 dimensional coordinate system or a full 3 dimensional object model in a 3 dimensional coordinate system. The objects may be kept in a hierarchical structure, such as a scene graph. Embodiments may use additional 3 dimensional object or scene properties known to the trade, such as, but not limited to, perspective, lighting effects, reflection, refraction, fog, etc.

In other embodiments, in order to determine the motion information, the current scene graph state and the previous scene graph state may be converted from a three dimensional representation into a two dimensional representation. The three dimensional representation may be for a worldview of the objects to be rendered and displayed on a display device. The two dimensional representation may be a screen view for displaying the objects on a display device. In addition to the motion information, in general there will be residual graphical information because the edges of moving objects generally do not map exactly on macroblock boundaries or objects are partially translucent, overlay or have quantization effects etc.

Embodiments of the invention may construct an MPEG encoded video sequence using the motion information including the corresponding motion vectors and residual graphical information that can be encoded. The scene states (previous and current) may result as the output of an application engine such as an application execution engine. The application execution engine may be a web browser, a script interpreter, operating system or other computer-based environment that is accessed during operation of the application. The application execution engine may interface with the described system using a standardized API (application programming interface), such as for example OpenGL. The system may translate the scene representation as expressed through the used API to a convenient internal representation or directly derive state changes from the API's primitives.

The current scene graph state includes a plurality objects having associated parameters. Some examples of parameters are the location of objects to be rendered, lighting effects, textures, and other graphical characteristics that may be used in rendering the object(s). A hash may be created for objects within a scene. The hash may be compared to a table of hashes that represent objects from previous scenes. If the current hash matches a hash within the table of hashes, MPEG encoded elements for the identified object are retrieved. The MPEG encoded elements can then be sent to a stitcher that can stitch together the MPEG encoded elements to form one or more MPEG encoded video frames in a series of MPEG encoded video frames.

In order to create the hash for the objects, the scene graph state is converted to a 2D or display representation. It is then determined which non-overlapping rectangles of the display represent state changes of the scene graph state. A hash is created for each rectangle, i.e. object; the previous and current state of the objects within these rectangles is hashed. These hashes are compared to hashes available in the table of hashes.

If the current hash does not match a hash in the table and no motion information can be determined by the scene graph state comparison for an object, the spatial data from the hashing process where the object is converted from a three dimensional representation to a two dimension screen representation is provided to an encoder wherein the encoder compresses the data using at least spatial techniques to produce one or more encoded elements. The encoder may encode according to a predetermined protocol such as MPEG, H.264 or another block based encoding protocol. The encoded elements are passed to a stitching module. The stitching module forms an encoded MPEG frame from the encoded elements where the encoded MPEG frame is part of an MPEG video sequence.

The methodology may be embodied as a computer program product where the computer program product includes a non-transitory computer readable medium having computer code thereon for creating an encoded video sequence. The above-described method may be embodied as a system that includes one or more processors that perform specified functions in the creation of the encoded video sequence.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing features of embodiments will be more readily understood by reference to the following detailed description, taken with reference to the accompanying drawings, in which:

FIG. 1 shows a detailed embodiment showing components that are used in processing application environment data and constructing an encoded video sequence from the data;

FIG. 2 shows a flow chart for implementing the functionality of relevant components of an embodiment of the invention;

FIG. 3 shows an environment for implementing the present invention;

FIG. 4 shows an exemplary screen shot of an application;

FIG. 5 shows a representative DOM tree for the application of FIG. 4;

FIG. 6 shows an exemplary scene graph model of the image of FIG. 4.

FIG. 7 shows a scene graph state with associated screen position information;

FIG. 8 shows a previous scene graph state and a current scene graph state

FIG. 9 shows a motion field between a first scene graph state and a second scene graph state;

FIG. 10 shows a motion field for the rotation of each macroblock of an image;

FIG. 11 shows typical embodiments of the invention;

FIG. 12 shows a flow chart for implementing the functionality of relevant components of an embodiment of the invention; and

FIGS. 13 and 14 demonstrate the tessellation process.

DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS

Definitions. As used in this description and the accompanying claims, the following terms shall have the meanings indicated, unless the context otherwise requires:

The term “application” refers to an executable program, or a listing of instructions for execution, that defines a graphical user interface (“GUI”) for display on a display device. An application may be written in a declarative language such as HTML or CSS, a procedural language such as C, JavaScript, or Perl, any other computer programming language, or a combination of languages.

“Application execution environment” is an environment that receives in an application including all of its components and manages the components and execution of the components to define a graphical layout and manages the interactions with the graphical layout. For example, Trident, WebKit, and Gecko are software layout engines that convert web pages into a collection of graphical objects (text strings, images, and so on) arranged, according to various instructions, within a page display area of a web browser. The instructions may be static, as in the case of parts of HTML, or dynamic, as in the case of JavaScript or other scripting languages, and the instructions may change as a function of user input. Trident is developed by Microsoft Corporation and used by the Internet Explorer web browser; WebKit is developed by a consortium including Apple, Nokia, Google and others, and is used by the Google Chrome and Apple Safari web browsers; Gecko is developed by the Mozilla Foundation, and is used by the Firefox web browser. Operating systems such as Google's Android and Apple's iOS may be considered application execution environment because these operating systems can execute applications. The output of an application execution environment is a screen state (either absolute or relative to a previous screen state). The screen state may be presented as a scene graph state.

“Rendering Engine” transforms a model of an image to actual data that can generate the image on a display device. The model of the image may contain two-dimensional or three-dimensional data as would be represented in a world space and the rendering engine takes the data and transforms the data into a screen-space representation wherein the data may be represented as pixels.

“Video Construction Module” compares scene states and derives which areas on the display device need to be changed. The video construction module determines how to map the changed areas to encoded fragments that can be stitched together and maintains a cache of already encoded fragments. If fragments are not available in an encoded form, the video construction module interacts with an fragment encoding module to encode the fragment.

“Fragment Caching Module” stores fragments in volatile memory (such as for example the system's RAM) or persistent memory (such as for example a disc based file system).

“Encoding Engine/Fragment Encoding Module” transforms graphical data and associated information about spatial and/or temporal relations into one or more encoded fragments.

“Stitching Engine/Module” receives as input one or more fragments (e.g., MPEG encoded elements) along with layout information and then constructs complete video frames for a video sequence (e.g. MPEG video frames for an MPEG elementary stream).

“Scene” is a model of an image generated by an application execution engine consisting of objects and their properties;

“Scene state” is the combined state of all objects and their properties at a particular moment in time.

“Scene Graph” is a specialized scene where objects have a hierarchical relation.

“Scene Graph State” is the combined state of all objects and their properties of a scene graph at a particular moment in time.

“API” (application programming interface) is an interaction point for software modules, providing functions, data structures and object classes with the purpose of using provided services in software modules.

“DOM” (document object model) is a convention for representing and interacting with objects in markup languages such as HTML and XML documents.

“DOM tree” is a representation of a DOM (document object model) for a document (e.g. an HTML file) having nodes wherein the topmost node is the document object.

“CSS” (cascading style sheets) provide the graphical layout information for a document (e.g. an HTML document) and how each object or class of objects should be represented graphically. The combination of a DOM object and the corresponding CSS files (i.e. layout) is referred to as a rendering object.

“Render layer” is a graphical representation of one or more objects of a scene graph state. For example, a group of objects that have a geographical relationship such as an absolute or a relative position to each other may form a layer. An object may be considered to be a separate render layer if, for example, the object is transparent, has an alpha mask or has a reflection. A render layer may be defined by a screen area, such as a screen area that can be scrolled. A render layer may be designated for an area that may have an overlay (e.g. a pop up). A render layer may be defined for a portion of a screen area if that area is to have an applied graphical filter such as a blur, color manipulation or shadowing. A layer may be defined by a screen area that has associated video content. Thus, a render layer may be a layer within a scene graph state or a modification of a scene graph state layer in which objects are grouped according to a common characteristic.

“Fragment” is one or more MPEG-encoded macroblocks, as disclosed in U.S. patent application Ser. No. 12/443,571, filed Oct. 1, 2007, the contents of which are incorporated by reference in their entirety. A fragment may be intra-encoded (spatially-encoded), inter-encoded (temporally-encoded) or a combination thereof.

Embodiments of the present invention provide for the extraction of spatial information as well as other graphical information from an application execution environment by using software integration points that are (for example) intended for communication between the application execution environment and Graphical Processing Unit (GPU) driver software. This spatial information can then be used for the creation of motion vectors for encoding of graphical content in a frequency-based encoding format, such as MPEG, AVS, VC-1, H.264 and other block-based encoding formats and variants that employ motion vectors.

Embodiments of the invention use the motion information exposed by an Application Execution Environment's GPU interface (or another suitable interface that allows access to the scene graph state) to obtain spatial and temporal information of the screen objects to be rendered, and to use that information to more efficiently encode the screen objects into a stream of MPEG frames.

In order to determine the motion information, the application execution Environment may access Z-ordering information from a scene graph for the rendering of objects. For example, the application execution environment can separate a background layer from a foreground image layer and the scene graph state may specify objects that are partially translucent. This information can be used to determine what information will be rendered from a 3-dimensional world view in a 2-dimensional screen view. Once the visible elements are determined, motion information can be determined and the motion information can be converted into motion vectors. Multiple motion vectors may be present for a particular screen area. For example, if two different layers (on different Z indices) are moving in different directions, the area would have different associated motion vectors. The encoder will determine a dominant vector given its knowledge on what is being rendered, including translucency, surface area of the moving object, texture properties (i.e. is it a solid or a pattern) etc.

FIG. 1 shows a detailed embodiment showing components that are used in processing application environment data and constructing an encoded video sequence from the data. The application environment data provides information about visual content to be rendered on a display device of a client. The data from an application execution environment 110 may be processed through one of a plurality of possible paths. The first path is a prior art path wherein the data from the application execution environment 110, which may be OpenGL library function calls is passed to a hardware-based graphics accelerator 120 and presented on a display 130. In an alternative path, the data from the application execution environment 110 is passed to a video construction engine 170. The video construction engine 170 exploits information within the data from the application execution engine to improve the encoding process and reduce the number of calculations that need to be performed. This path will be explained in greater detail below with respect to embodiments of the invention.

FIG. 1 is now explained in more detail. An application is constructed in an application editor 100. The application editor 100 may be an integrated development environment (IDE) or a text editor for example. The output of the application editor may include one or more sections. The application may be composed of one or more of the following: HTML (hypertext markup language) data, CSS (cascading style sheets) data, script(s) from various scripting languages such as JavaScript and Perl, program code, such as, JAVA for execution in an application execution environment and/or executable programs (*.exe). The components of the application may then be executed in an application execution environment 110 in response to a request for the application by a client device operating remotely from the application execution environment. An application execution environment receives in the application including its various components and creates an output file that can be used for display on a display device of the client. For example, the application execution environment may create a program referencing a number of OpenGL library functions/objects. OpenGL is a specification that describes an abstract API for drawing 2D and 3D graphics and is known to one of ordinary skill in the art.

As shown, the Application Execution Engine 110 may produce an output for graphical processing. Examples of application execution environments include both computer software and hardware and combinations thereof for executing the application. Applications can be written for certain application execution environments including WebKit, JAVA compilers, script interpreters (Perl etc.) and various operating systems including iOS and Android OS for example.

The video construction engine 170 takes advantage of the data that it receives from the application execution environment in order to exploit redundancies in requests for the presentation of information within user sessions and between user sessions as well as determining motion changes of objects from a previous video frame or scene graph state to a current frame or scene graph state. The present system may be used in a networked environment wherein multiple user sessions are operational simultaneously wherein requested applications may be used by multiple users simultaneously.

The video construction engine 170 may receive in OpenGL data and can construct a scene graph from the OpenGL data. The video construction engine 170 can then compare the current scene graph state to one or more previous scene graph states to determine if motion occurs between objects within the scene. If motion occurs between the objects, this motion can be translated into a motion vector and this motion vector information can be passed to an encoding module 150. Thus, the encoding module 150 need not perform a motion vector search and can add the motion vectors into the video frame format (e.g. MPEG video frame format). The MPEG elements can be constructed that are encoded MPEG macroblocks that are inter-frame encoded. These macroblocks are passed to the stitching module 160 that receives stitching information about the video frame layout and stitches together encoded MPEG elements to form complete MPEG encoded video frames in accordance with the scene graph. Either simultaneously or in sequence, the MPEG video construction engine may hash the parameters for objects within the scene graph according to a known algorithm. The construction engine 170 will compare the hash value to hash values of objects from previous scene graphs and if there is a match within the table of hashes, the construction engine 170 will locate MPEG encoded macroblocks (MPEG elements) that are stored in memory and are related to the hash. These MPEG elements can be passed directly to the stitching engine 160 wherein the MPEG elements are stitched together to form complete MPEG encoded video frames. Thus, the output of the stitching module 160 is a sequence of encoded video frames that contain both intra-frame encoded macroblocks and inter-frame encoded macroblocks. Additionally, the video construction engine 170 outputs pixel based information to the encoder. This pixel-based information may be encoded using spatial based encoding algorithms including the standard MPEG DCT processes. This pixel based information occurs as a result of changes in the scene (visual display) in which objects represented by rectangles are altered. The encoded macroblocks can then be passed to the stitcher. The processes of the video construction engine 170 will be explained in further detail with respect to the remaining figures.

FIG. 2 shows a flow chart for implementing the functionality of relevant components of an embodiment of the invention. A user of the system at a client device interacts with the application through the application execution engine. The user makes a request for content through a key press or other input that generates a control signal that is transmitted from the client device to the application execution engine that indicates that there should be a screen update of one or more screen elements (e.g. rectangles). Thus, the rectangles to be updated can be defined as a dirty rectangle that will need either to be retrieved from memory if the dirty rectangle has previously been rendered and encoded or provided to an encoder. The encoder may receive motion vector information, which will avoid motion vector calculations and the encoder may receive spatial data for dirty rectangles, which need to be spatially encoded.

The application execution engine may be proximate to the client device, operational on the client device, or may be remote from the client device, such as in a networked client/server environment. The control signal for the dirty rectangle causes the application execution engine to generate a scene graph having a scene graph state that reflects the changes to the screen (e.g. dirty rectangles of the screen display). For example, the application execution environment may be a web browser operating within an operating system. The web browser represents a page of content in a structured hierarchical format such as a DOM and corresponding DOM tree. Associated with the DOM tree is a CSS that specifies where and how each object is to be graphically rendered on a display device. The web browser creates an output that can be used by a graphics engine. The output that is produced is the scene graph state which may have one or more nodes and objects associated with the nodes forming a layer (i.e. a render layer) 200. As requests occur from a client device for updates or updates are automatically generated as in a script, a new or current scene graph state is generated. Thus, the current scene graph state represents a change in the anticipated output video that will be rendered on a display device. An exemplary scene graph state is shown in FIG. 6 described below.

Once the current scene graph state is received by the video construction engine 200, the scene graph state can be compared with a previous scene graph state 210. The comparison of scene graph states can be performed hierarchically by layer and by object. For each object associated with a node differences in the positions of objects from the scene graph states can be identified as well as differences in characteristics, such as translucence and lighting.

For example, in a simple embodiment, a circle may be translated by a definable distance between the current scene graph state and a previous scene graph state. The system queries whether one or more objects within the scene graph state have moved. If one or more objects have been identified as moving between scene graph states information about the motion translation are determined 220. This information may require the transformation of position data from a three dimensional world coordinate view to a two dimensional screen view so that pixel level motion (two dimensional motion vectors) can be determined. This motion information can then be passed on to an encoder in the form of a motion vector 230. Thus, the motion vector information can be used by the encoder which to create inter-frame encoded video frames. For example, the video frames may be P or B frame MPEG encoded frames.

In addition to objects moving, scene elements may also change. Thus, a two dimensional representation of information to be displayed on a screen can be ascertained from the three-dimensional scene graph state data. Rectangles can be defined as dirty rectangles, which identify data on the screen that has changed 240. These rectangles can by hashed according to a known formula that will take into account properties of the rectangles 250. The hash value can then be compared to a listing of hash values associated with rectangles that were updated from previous scene graph states 260. The list of hash values may be for the current user session or for other user sessions. Thus, if a request for a change in the content being displayed in an application is received from multiple parties, the redundancy in information being requested can be exploited and processing resources conserved. More specifically, if the hash matches a hash within the searchable memory, encoded graphical data (e.g. either a portion of an entire video frame of encoded data or an entire frame of encoded data) that is linked to the hash value in the searchable memory is retrieved and the data can be combined with other encoded video frames 270.

Additionally, if a rectangle is identified as being dirty and a hash is not identified, the spatial information for that rectangle can be passed to the encoder and the MPEG encoder will spatially encode the data for the rectangle. As used herein, the term content, may refer to a dirty rectangle or an object from a scene graph state.

FIG. 3 shows an embodiment of the present invention showing the data flow between an application execution environment 300 and the data flow internal to the video construction engine 310. As previously indicated an application execution environment 300 receives as input an application and the application execution environment 300 executes the application and receives as input user requests for changes to the graphical content that is displayed on a display device associated with the user.

The application execution environment 300 creates a current scene graph 320. The current scene graph may be translated using a library of functions, such as the OpenGL library 330. The resulting OpenGL scene graph state 340 is passed to the video construction engine 310. The OpenGL scene graph state 340 for the current scene graph is compared to a previous scene graph state 350 in a comparison module 360. This may require the calculation and analysis of two-dimensional projections of three-dimension information that are present within the scene graph state. Such transformations are known by one of ordinary skill in the art. It should be recognized that OpenGL is used herein for convenience and that only the creation of a scene graph state is essential for the present invention. Thus, the scene graph state need not be converted into OpenGL before a scene graph state comparison is performed.

Differences between the scene graphs are noted and dirty rectangles can be identified 370. A dirty rectangle 370 represents a change to an identifiable portion of the display (e.g. a button changing from an on-state to an off-state). There may be more than one dirty rectangle that is identified in the comparison of the scene graph states. Thus, multiple objects within a scene may change simultaneously causing the identification of more than one dirty rectangle.

From the list of dirty rectangles 370, a list of MPEG fragment rectangles (i.e. spatially defined fragments, such as a plurality of macroblocks on macroblock boundaries) can be determined for the dirty rectangle. It should be recognized that the term MPEG fragment rectangle as used in the present context refers to spatial data and not frequency transformed data and is referred to as an MPEG fragment rectangle because MPEG requires a block-based formatting schema i.e. macroblocks that are generally 16×16 pixels in shape. Defining dirty rectangles as MPEG fragment rectangles can be achieved by defining an MPEG fragment rectangle for a dirty rectangle wherein the dirty rectangle is fully encompassed within a selection of macroblocks. Thus, the dirty rectangle fits within a rectangle composed of spatially defined macroblocks. Preferably the dirty rectangles are combined or split to limit the number of MPEG fragment rectangles that are present or to avoid small changes in large rectangles.

For each MPEG fragment rectangle, a listing of nodes according to z-order (depth) in the scene graph that contributed to the rectangle contents is determined. This can be achieved by omitting nodes that are invisible, have a low opacity, or have a transparent texture.

For each MPEG fragment rectangle, a hash value 382 is created based upon relevant properties of all nodes that have contributed to the rectangle contents (for example absolute position, width, height, transformation matrix, hash of texture bitmap, opacity). If the cache contains an encoded MPEG fragment associated with that hash value, then the encoded MPEG fragment is retrieved from the cache. In the present context, the term encoded MPEG fragment, refers to a portion of a full frame of video that has been encoded according to the MPEG standard. The encoding may simply be DCT encoding for blocks of data or may also include MPEG specific header information for the encoded material. If the calculated hash value does not match an MPEG fragment in the cache, then the dirty rectangle contents (using the scene graph state) are rendered from a three dimensional world view to a two dimensional screen view and the rendered pixel data (i.e. spatial data) are encoded in an encoder, such as an MPEG encoder 385. The encoded MPEG data (e.g. encoded MPEG fragment(s)) for the scene is stored into the cache.

As part of the encoding process, the fragment is analyzed to determine whether the encoding can best be performed as ‘inter’ encoding (an encoding relative to the previous screen state) or whether it is encoded as ‘intra’ encoding (an independent encoding). Inter-encoding is preferred in general because it results in less bandwidth and may result in higher quality streams. All changes in nodes between scene graphs are determined including movement, changes of opacity, and changes in texture for example. The system then evaluates whether these changes contribute to a fragment, and whether it is possible to express these changes efficiently into the video codec's primitives. If the evaluation indicates that changes to dominant nodes can be expressed well in the video codec's primitives, then the fragment is inter-encoded. These steps are repeated for every screen update. Since the ‘new scene graph’ will become ‘previous scene graph’ in a next screen update, intermediate results can be reused from previous frames.

FIG. 4 shows an exemplary screen shot 400 of an application that may be rendered on a display device according to the previously described methodology. As shown, the display shows a video frame of the application that has the title “Movie Catalogue.” 410 The video frame also includes a static background 420 and also shows a plurality of selectable movie frames 431, 432, 433, 434. Each movie frame is selectable and associated with a separate underlying movie. The movie frames may include one or more full-motion elements (e.g. may display a clip from the movie or a transition of multiple images, or may be movable in a scripted fashion) the video frame 400 includes the titles (431 a, 432 a, 433 a, 434 a) for each of the displayed movies. In the present example, there are four movie frames and associated titles displayed on the current screen. Additionally, the video frame includes a right pointing arrow 440 and a left pointing arrow 450 that when selected provides the user with additional movies that may be selected. This screen shot may be displayed using an application such as a web-browser or another graphical display application such as an application execution environment. It should be understood that the application may reside remote from the client device wherein video content, such as a sequence of MPEG video frames such as an MPEG elementary stream is sent from a server to the client device. The video content represents the output display of the application and the server may include the environment for executing the application and the graphical output is transformed to an MPEG elementary stream in accordance with disclosed embodiments.

FIG. 5 shows a representative DOM tree 500 for the application of FIG. 4. The DOM tree is a document object model representation of the hierarchical objects in a tree structure with associated nodes. A document object model is a cross-platform and language independent convention for representing and interacting with objects in HTML, XHTML and XML documents. The document object model does not include position information, fonts or effects. This information would be included in an associated CSS document (cascading style sheet document). As shown, there are four levels (501-504) to the DOM tree and the nodes entitled “Body” 502 and “list” 503 each include multiple sub-nodes. Thus, the Body node 502 includes the, Title, 1-arrow, list, r-arrow objects 510, 511, 512,513 and also the list objects of cover1, cover2, cover3, and cover4 objects 520, 521, 522, 523. The construction of DOM trees are well known in the art and are typically performed by applications, such as, web browsers.

FIG. 6 shows an exemplary scene graph model of the application screen shot of FIG. 4 that can be built based upon the DOM tree of FIG. 5. A scene graph is a data structure used for representing both logical and spatial objects for a graphical scene. The complete “scene graph state” includes also the textures, spatial information that describes how the texture is positioned into a 2D or 3D space (e.g. a transformation matrix), and all other attributes that are necessary to render the screen. In an exemplary embodiment using the OpenGL API to interface to WebKit, the spatial information for the present example is a 4×4 matrix that specifies translation (i.e. position of the texture in space), rotation, slanting, shearing, shrinking etc. For simplicity, the following examples use only 2D coordinates, but it should be understood that this could be extended to a 3D transformation matrix. Programs that employ scene graphs include graphics applications (e.g. WebKit, Adobe Acrobat, AutoCAD, CorelDraw, VRML97 etc., graphics acceleration programs and corresponding graphics acceleration hardware and additionally 3D applications and games.

The tree like structure provides a hierarchical representation wherein attributes of parent objects can be attributed to the child objects. The root object represents the entire scene 610, while child nodes of a certain node may contain a decomposition of the parent node into smaller objects. The nodes contain may contain a texture (bitmap object), a 3D transformation matrix that specifies how the texture is positioned in a 3D space, and I or other graphical attributes such as visibility and transparency. A child node inherits all attributes, transformations, filters, from the parent node.

For example, movement between scene graphs for an object such as the “cover list” 620 would indicate that each of the child objects (cover1, cover2, cover3, and cover4) 621, 622, 623, 624 would also move by an equal amount. As shown, the screen shot of FIG. 4 includes a hierarchy wherein there is a static layer 615, a cover list layer 620, and a background layer 630 and cover1, cover2, cover3, and cover4 are at a sub-layer for the cover list layer. The choice of objects that are associated with a specific layer is performed by the application execution environment, such as in a web browser.

FIG. 7 shows a scene graph state with associated screen position information. As shown, the upper left position of each object is provided in scene graph (i.e. world coordinates). For example, the cover list layer 620 begins at (30, 400), which is 30 pixels in the X direction (assuming standard video X, Y coordinates) and 400 pixels down in the Y direction. This scene graph state allows a web browser or other application that produces a scene graph state to instruct a graphical processing unit or other program, such as embodiments of the invention that include a video construction module, such as that shown and discussed with respect to FIGS. 11 and 12 to render the movie covers 621, 622, 623, 624 including certain effects (shadows, reflections) and to be able to manipulate the position of these objects. The web browser or other application execution environment would then pass the scene graph state and request rendering of the screen. Often the standardized OpenGL API is used for this communication to be able to interface to many different GPUs. The OpenGL API is not only used by web browsers, but by many applications in general, across many Operating Systems (Linux, Windows, Android).

FIG. 8 shows a previous scene graph state 800 and a current scene graph state 810 where the previous scene graph state is on the left and the current scene graph state is on the right. As shown, in both scene graph states there are three layers, a static layer, a cover list layer, and a background layer that are all coupled to the head node. The cover list layer has an additional four objects (cover1, cover2, cover3 and cover 4) at a lower sub-layer. According to embodiments the invention, the scene graph states are compared, where for example the previous transformation matrix is subtracted from the current transformation matrix. This yields the motion of the objects relative to their previous position. It is thus discovered that cover1, cover2, cover3, and cover4 have moved 10 units in the ‘x’ axis direction (e.g. cover1 moves from 60,430 to 70,430 etc.) It is then determined which macroblocks are covered by the new positions of the covers, and a motion vector is set to (10, 0) for each of these macroblocks.

The scene graph comparison between the previous scene graph and the current scene graph may be performed in the following manner wherein the scene graph is transformed from a 3D to a 2D space. A node in a scene graph consists of an object having a texture (2D bitmap) and a transformation how that object is floating in space. It also contains the z-order (absolute order to render things). In OpenGL the transformation consists of a matrix:

$\quad\begin{matrix} {m\lbrack 0\rbrack} & {m\lbrack 4\rbrack} & {m\lbrack 8\rbrack} & {m\lbrack 12\rbrack} \\ {m\lbrack 1\rbrack} & {m\lbrack 5\rbrack} & {m\lbrack 9\rbrack} & {m\lbrack 13\rbrack} \\ {m\lbrack 2\rbrack} & {m\lbrack 6\rbrack} & {m\lbrack 10\rbrack} & {m\lbrack 14\rbrack} \\ {m\lbrack 3\rbrack} & {m\lbrack 7\rbrack} & {m\lbrack 11\rbrack} & {m\lbrack 15\rbrack} \end{matrix}$

This transformation is applied to an element ‘a’ in a 3D space by matrix multiplication. The element ‘a’ is identified by four points: the origin and the three top positions of the object in x, y and z direction. The bottom row (i.e. elements m[12], m[13] and m[14]) specifies translation in 3D space. Elements m[0], m[4], m[8], m[1], m[5], m[9], m[2], m[6], m[10] specify the three top positions of an object (i.e. furthest point out in x, y, z direction) where that particular point will end up by using matrix multiplication. This allows for object or frame rotation, slanting, shearing, shrinking, zooming, and translation etc. and repositioning of the object in world space at any time.

When two transformations have been applied to an object according to matrix ‘m’ (from the previous scene graph) and ‘n’ (from the current scene graph) then the “difference” between the two is m-n: matrix subtraction. The result of the matrix subtraction gives the amount of rotation, slanting, shearing, shrinking, zooming, translation etc. that has been performed to the object between the previous frame and the current frame.

Projecting a 3D image to a 2D surface is well known in the art. In one embodiment, the system first calculates projections of the 3D scene graphs onto a 2D plane, where the transformation matrices also become 2D. The motion vector (obtained by subtracting the transformation matrices) is then 2D and can be directly applied by the MPEG encoder. One motion vector per (destination) macroblock is passed, if motion was detected. The motion vector has a defined (x, y) direction, having a certain length that indicates direction and distance covered between the current frame and the previous frame. The encoder then assumes that the reference information for a macroblock is located in the reverse direction of the motion vector. If no motion was detected, then either the macroblock did not change, or it changed entirely and then it is intra-encoded.

FIG. 9 is an exemplary motion field that shows all of the motion vectors for macroblocks in a scene wherein all of the macroblocks have moved 10 units to the right. This might happen in a scrolling scenario where a user provides user input wanting to move elements on the display screen to the right. The user may be viewing a television or other device and may send a control signal to the server that is indicative of a right arrow key or a right-ward swipe. This control signal is received by the system and the control signal is used to generate a scene graph update within the Application Execution Environment. Once a scene graph is created, the video construction module and the internal components of the video construction module create an encoded video signal that is transmitted from the server to the client device and then displayed on the client device. The provided motion field is the result of the scene graph state comparison between the previous and current scene graph states wherein the transformation matrices are subtracted.

FIG. 10 shows a motion field for the rotation of an image. For this example, the transformation matrices of the previous and current scene graph states are subtracted and the motion vectors indicate that there is a rotation of the objects within the image. Note that the macroblocks themselves are not rotated; consequently, there will be a residual error after the motion has been compensated. Thus, residual error calculations as are known in the art for motion vectors may be calculated. The residual error may be considered to be graphical information. This may be performed by the MPEG encoder or by the video construction module. Slanting, shearing, and other movements will result in other motion fields.

Hashing and caching of dirty rectangles on individual layers of a scene graph state is more efficient compared to hashing and caching of 2D projection of these layers, because the layers represent independent changes.

It should be noted that some Application Execution Environments might use one ‘background’ layer where it renders objects for which it chooses not to create a separate render layer. This could be a wall clock, for example. Changes to this layer are analyzed resulting in one or more dirty rectangles. In principle all rectangles depend on the background (if the background changes, parts of the background are likely visible in the rectangle due to the macroblock snapping). To avoid the background being part of every rectangle's hash function, and thus to avoid a re-rendering and re-encoding of all rectangles when the background changes (e.g. when the seconds hand moves in the wall clock object), the background is excluded from the scene graph and it is not available as an MPEG fragment.

FIG. 11 shows 3 typical embodiments of application execution engines using proprietary or standardized APIs to interact with the video construction engine. 1101 is a DOM-based application execution engine, such as for example the Webkit application execution engine. A DOM-based application execution engine may use a scene graph module (1104) to map render layers to the video construction engine (1107). Another embodiment may be a non-DOM-based application execution engine (1102) that may use a 2D API to interface to a 2D API module (1105) to interact with the video construction engine. In other embodiments, Games (1103) may interact through an OpenGL API with an OpenGL module (1106) that translates OpenGL primitives to calls into the video construction engine. In all presented embodiments, the API providers may interface with the video construction engine through a common API (1108).

FIG. 12 shows a flow chart for implementing the functionality of the video construction module, in some embodiments. Input to the comparison step 1201 is a current scene state and a previous scene state. The comparison yields a list of objects that have changed and the delta of the object's changed properties, such as for example the object's position, transformation matrix, texture, translucency, etc., or any objects that have been added or removed from the current scene.

Since embodiments of the invention may maintain objects in a 2 dimensional coordinate system, 2 dimensional (flat) objects in a 3 dimensional coordinate system or a full 3 dimensional object model in a 3 dimensional coordinate system, a mapping has to be made for each object from the scene's coordinate system to the current and previous field of view. The field of view is the extent of the observable scene at a particular moment. For each object on the list of changed, added or removed objects it is determined in step 1202 whether the object's change, addition or removal was visible in the field of view of the scene's current state or the field of view of the previous state and what bounding rectangle represented that change in said states.

Bounding rectangles pertaining to the objects' previous and current states may overlap in various constellations. Fragments, however, cannot overlap and before any fragments can be identified, overlapping conditions have to be resolved. This is done in step 1203 by applying a tessellation or tiling process as depicted in FIG. 13.

Suppose that overlapping rectangles 1301 for object A and 1302 for object B as depicted by FIG. 13 are processed by the tessellation process. Both rectangles are first expanded to only include complete macroblocks, so that the resulting tessellations can be used to derive fragments from. The tessellation process than applies an algorithm that yields non-overlapping rectangles which combined are the equivalent of the union of the original rectangles. A possible outcome of the tessellation process for the constellation as depicted in FIG. 13 is depicted in FIG. 14. In this case, the process yields 3 rectangles; 1401 to 1403. These rectangles are non-overlapping and macroblock aligned. Combined they are equivalent to the overlapping rectangles of FIG. 13. The reader will appreciate that this is just one example of many equivalent tessellation results and that the preferred tessellation may depend on policies. One policy, for example, is that rectangles representing the current state should be having preference over rectangles that represent the previous state.

Returning to step 1203, the tessellation process is first applied to the rectangles pertaining to the objects' previous states. When an object changes position or its transformation matrix changes, graphical data may be revealed that was obscured in the previous state. The object's new bounding rectangle usually only partially overlaps with the object's previous bounding rectangle. A fragment has to be made that encodes this exposure. Therefore, step 1203 first applies the tessellation process to all bounding rectangles of the objects' previous states. Subsequently, the bounding rectangles of the objects' current states are added to the tessellation process. The resulting rectangles represent the fragments that constitute the update from the previous scene state to the current scene state. Steps 1204 to 1208 are performed for each fragment.

Step 1204 determines the fragment's properties, such as whether the fragment is related to the current state or the previous state, which objects contribute to the fragment's pixel representation, and which contributing object is the dominant object. If an object dominates the fragment's pixel representation, the object's rectangle pertaining to the previous state is used as a reference window for temporal reference and the fragment may be inter encoded. If multiple objects dominate the fragment's representation a union of multiple previous state rectangles may be used as a reference window. Alternatively, the fragment's current bounding rectangle may be used as a reference window.

The fragments' properties as determined by step 1204 are used in step 1205 to form hash values that uniquely describe the fragment. A hash value typically includes the coordinates of the fragment's rectangle, the properties of contributing objects and encoding attributes that may be used to distinguish encoder specific variants of otherwise equivalent fragments such as profile, level or other codec specific settings, differences in quantization, use of the loop filter, etc. . . . If the fragment has a reference window, the hash is extended with the coordinates of the reference window in pixel units, the properties of the objects contributing to the reference window and the transformation matrix of the dominant object. All in all the hash uniquely describes the fragment that encodes the scene's current state for the fragment's rectangle and if a temporal relation could be established, a transition from the scene's previous state to the current.

In step 1206 the hash uniquely identifying the fragment is checked against a hash table. If the hash cannot be found in the hash table, the fragment description is forwarded to the fragment encoding module and step 1207 is applied. If the hash is found in the hash table, the associated encoded fragment is retrieved from the fragment caching module and step 1208 is applied.

In step 1207 fragments are encoded from pixel data pertaining to the current scene state and, if available, pixel data pertaining to the previous scene state and meta data obtained from the scene's state change (such as for example the type of fragment, transformation matrices of the objects contributing to the fragment, changes in translucency of the objects) into a stitchable fragment. It is in this step that many efficiency and quality improvements are achieved. Many steps in the encoding process, such as the intra/inter decision, selection of partitions, motion estimation and weighted prediction parameters benefit from the meta data because it allows for derivation of the spatial or temporal relations relevant for the encoding process. Examples of such benefits are provided in the remainder of this document. Once a fragment has been encoded the fragment is stored in the fragment caching module and step 1208 is applied.

Step 1208 forwards stitchable fragments to the stitching module.

It should be noted that objects are generally handled as an atomic entity, except for the background object. The background object is a fixed object at infinite distance that spans the entire field of view. A consequence of treating the background as an atomic entity would mean that small changes to the background would potentially permeate in the hash values of all fragments in which the background is visible. Therefore, the background texture is treated in the same way as disclosed in U.S. application Ser. No. 13/445,104 (Graphical Application Integration with MPEG Objects), the contents of which are hereby incorporated by reference, and changes to the background only have consequences for the fragments overlapping the dirty rectangles of the background.

The following examples relate to a DOM-based application embodiment equivalent to FIG. 11 component 1101, 1104 and 1107.

The present invention may be embodied in many different forms, including, but in no way limited to, computer program logic for use with a processor (e.g., a microprocessor, microcontroller, digital signal processor, or general purpose computer), programmable logic for use with a programmable logic device (e.g., a Field Programmable Gate Array (FPGA) or other PLD), discrete components, integrated circuitry (e.g., an Application Specific Integrated Circuit (ASIC)), or any other means including any combination thereof. In an embodiment of the present invention, predominantly all of the reordering logic may be implemented as a set of computer program instructions that is converted into a computer executable form, stored as such in a computer readable medium, and executed by a microprocessor within the array under the control of an operating system.

Computer program logic implementing all or part of the functionality previously described herein may be embodied in various forms, including, but in no way limited to, a source code form, a computer executable form, and various intermediate forms (e.g., forms generated by an assembler, compiler, networker, or locator.) Source code may include a series of computer program instructions implemented in any of various programming languages (e.g., an object code, an assembly language, or a high-level language such as Fortran, C, C++, JAVA, or HTML) for use with various operating systems or operating environments. The source code may define and use various data structures and communication messages. The source code may be in a computer executable form (e.g., via an interpreter), or the source code may be converted (e.g., via a translator, assembler, or compiler) into a computer executable form.

The computer program may be fixed in any form (e.g., source code form, computer executable form, or an intermediate form) either permanently or transitorily in a tangible storage medium, such as a semiconductor memory device (e.g., a RAM, ROM, PROM, EEPROM, or Flash-Programmable RAM), a magnetic memory device (e.g., a diskette or fixed disk), an optical memory device (e.g., a CD-ROM), a PC card (e.g., PCMCIA card), or other memory device. The computer program may be fixed in any form in a signal that is transmittable to a computer using any of various communication technologies, including, but in no way limited to, analog technologies, digital technologies, optical technologies, wireless technologies, networking technologies, and inter-networking technologies. The computer program may be distributed in any form as a removable storage medium with accompanying printed or electronic documentation (e.g., shrink wrapped software or a magnetic tape), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server or electronic bulletin board over the communication system (e.g., the Internet or World Wide Web.)

Hardware logic (including programmable logic for use with a programmable logic device) implementing all or part of the functionality previously described herein may be designed using traditional manual methods, or may be designed, captured, simulated, or documented electronically using various tools, such as Computer Aided Design (CAD), a hardware description language (e.g., VHDL or AHDL), or a PLD programming language (e.g., PALASM, ABEL, or CUPL.)

While the invention has been particularly shown and described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended clauses. As will be apparent to those skilled in the art, techniques described above for panoramas may be applied to images that have been captured as non-panoramic images, and vice versa.

Embodiments of the present invention may be described, without limitation, by the following clauses. While these embodiments have been described in the clauses by process steps, an apparatus comprising a computer with associated display capable of executing the process steps in the clauses below is also included in the present invention. Likewise, a computer program product including computer executable instructions for executing the process steps in the clauses below and stored on a computer readable medium is included within the present invention. 

What is claimed is:
 1. A method for creating a composited video frame sequence, comprising: at a system including one or more processors and memory storing instructions for execution by the processor: comparing a current scene state with a previous scene state, wherein the current scene state includes a first plurality of objects having respective properties, and wherein the previous scene state includes a second plurality of objects having respective properties; detecting a difference between the respective properties of the first plurality of objects and the respective properties of the second plurality of objects; in accordance with the difference between the respective properties being detected, retrieving one or more pre-encoded first video fragments based on the detected difference, wherein each of the one or more pre-encoded first video fragments is a portion of a full frame of video; and compositing the video frame sequence, wherein the video frame sequence includes at least one of the one or more pre-encoded first video fragments.
 2. The method of claim 1, wherein the one or more pre-encoded first video fragments are retrieved from a memory.
 3. The method of claim 2, wherein the memory is non-volatile memory.
 4. The method of claim 2, wherein the memory is volatile memory.
 5. The method of claim 1, wherein the difference detected between the respective properties of the first plurality of objects and the respective properties of the second plurality of objects corresponds to at least one property from a group consisting of: a position, transformation matrix, texture, and translucency, of a respective object.
 6. The method of claim 1, wherein detecting the difference between the respective properties of the first plurality of objects and the respective properties of the second plurality includes: tessellating a first bounding rectangle, corresponding to at least one object of the first plurality of objects, with a second bounding rectangle, corresponding to at least one object of the second plurality of objects.
 7. The method of claim 1, the method further comprising: in accordance with the difference between the respective properties being detected, encoding one or more second video fragments based on the detected difference, wherein the video frame sequence further includes at least one of the one or more encoded second video fragments.
 8. A computer system for creating a composited video frame sequence for an application, comprising: one or more processors; memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for: comparing a current scene state with a previous scene state, wherein the current scene state includes a first plurality of objects having respective properties, and wherein the previous scene state includes a second plurality of objects having respective properties; detecting a difference between the respective properties of the first plurality of objects and the respective properties of the second plurality of objects; in accordance with the difference between the respective properties being detected, retrieving one or more pre-encoded first video fragments based on the detected difference, wherein each of the one or more pre-encoded first video fragments is a portion of a full frame of video; and compositing the video frame sequence, wherein the video frame sequence includes at least one of the one or more pre-encoded first video fragments.
 9. A non-transitory computer readable storage medium, storing one or more programs for execution by one or more processors of a computer system, the one or more programs including instructions for: comparing a current scene state with a previous scene state, wherein the current scene state includes a first plurality of objects having respective properties, and wherein the previous scene state includes a second plurality of objects having respective properties; detecting a difference between the respective properties of the first plurality of objects and the respective properties of the second plurality of objects; in accordance with the difference between the respective properties being detected, retrieving one or more pre-encoded first video fragments based on the detected difference, wherein each of the one or more pre-encoded first video fragments is a portion of a full frame of video; and compositing the video frame sequence, wherein the video frame sequence includes at least one of the one or more pre-encoded first video fragments. 